MED 117: A dataset of medicinal plants mostly found in Assam with their leaf images, segmented leaf frames and name table

Medicinal plants are a potential source of income particularly for rural populations in India who rely on these medicinal plants to treat a variety of different diseases through both targeted temporary and daily use. Through this data paper we have given a reference to our collected specimen set where almost 117 (one hundred seventeen) medicinal plant species with their leaf samples are stored. We have used Mendeley platform to store the dataset and visited many medicinal plant gardens situated in Assam to collect them. The dataset consists of raw leaf samples, U-net segmented gray leaf samples and a plant name table. The table includes Botanical name, family of the species, Common name and Assamese name. For segmentation U-net model was used and the resultant U-net segmented gray image frames are uploaded to the database. These segmented samples can be directly used for deep learning model for training and classification. Researchers will be able to use those to build recognition tool for Android or PC based system.


Specifications Table
flower, bark, seeds can be used to cure most of the diseases. These plants are now commercially available and treated as nation's source of income. • Leaf of medicinal plants can be considered as a reliable part for its identification because flowers, fruits and seeds are not found in each and every species but leaves are generally found. Different digital image processing technique can be easily applied on leaf sample on leaf sample is easy because of their structured orientation and features that can be understood by different machine learning segmentation algorithms as well as classification model. • In this paper we are attaching a set of medicinal plant leaf samples of 117 species and the U-net segmented image set that can be used by other researchers for further classification and recognition. We are also providing their Scientific names, Common names and Assamese names. • People friendly Android and PC based system/application can be developed for identification of the plants from their leaf.

Objective
Leaves of plants are generally available throughout the year compared to flowers and fruits that appears seasonally. As such leaves provide a more consistent resource for identification year around. The samples collection and make them suitable to work on digital platform is really difficult. The collected raw samples were in video format and they were partitioned into static images for use in digital image processing algorithms. The dataset mentioned here can help many ways mentioned below..
• Majority of medicinal plant leaves publicly available are stored in single platform. It will contribute in the process of natural resource identification. Recently Government of India is taking interest on cultivation of valued medicinal plants in large scale and thus this small effort may put light on to improve nation's economy. • This table added to the dataset is a reliable source of information. It was prepared by normal discussion with experts and field executives. • The raw leaves are preprocessed by using U-Net segmentation technique. These segmented dataset is also made public and accessible through the mentioned link. These samples can be used for image classification and further image processing research. • There is no dedicated digital or web based system that can recognize a plant from its leaf.
With the help of this sample set web based medicinal plant identification system can be built up and it will surely reduce pain of searching and identifying a plant manually.

Data Description
The videos of the leaves were captured from their front side only. But it was taken from different angles and also vertical rotation of the camera was performed while capturing. Depending upon the rotation we could managed number of frames acquired from every video. Table 1 shows the number of frames collected from each species with their family names. Video data augmentation is used to extend the sample size. In our work we have used a number of transformations to increase the sample size such that our dataset becomes volumeous and ready for deep learning model. Transformations we have used here are cropping, zooming in and out, shifting and rotations. Leaves of a particular plant mostly have same shape, rib and needle orientation but of different sizes. Fig. 1 shows a small set of leaf samples with various shapes and sizes.
In this paper we are discussing about segmented dataset also. Model used here to segment the dataset is U-Net segmentation. Users can directly use this dataset for more extensive research work or can use other segmentation method on the raw frames. In Table 1 all the 117 medicinal plant species are listed. It shows the botanical names, family name inside a big bracket. Total frames present is listed in bold face. The plants are sorted in alphabetic ascending order and read from left to right in each row. Researchers can access these data directly from the web link mentioned in Data Accessibility section. ( continued on next page )

Sample Verification
Researchers have used three step verification for sample authentication. We have visited some of the renown medicinal gardens and collected the leaf samples and identified them with the help of the care takers and officials of the respective gardens. Both the leaf samples and snapshots of the plants with name plates (if available) were captured from those sites. We have consulted two renown books written on medicinal plants of North East India.

Data Acquisition
The

Segmentation
Image segmentation is the process of breaking down the image into its constituent parts called segments. This is required in image classification because it helps to reduce complexity of the image and prepares for training on the neural network model [3] . All similar picture elements (i.e., pixels) are collected under one group. Thus different groups are formed with similar feature values. Segmentation removes background noises and unwanted regions from the image. We have used here U-Net segmentation algorithms to segment the raw dataset and the Gray segmented dataset is uploaded in the sub folder "Segmented leaf set using UNET segmentation" in mendeley platform.

U-Net Model
Originally U-Net was designed for segmentation of biomedical images. This segmentation method helps to focus on area of divergence. In general Convolution Neural Network (CNN) is used for labeling where input is image and output is label. Classification is performed on every pixel, so both input and output image enclose the same size. Fig. 2 shows the basic model for U-Net architecture comprised of a constricting path (decreases with the input size) and an expansive path (increases with its input size). The contracting path is serial blocks of Convolutional network, each block consists of two Convolutions layers, a Rectified Linear Unit ( ReLU ) activation function and a max poolin g layer for down sampling. Every down sampling step is increased by twice the number of feature channels [4] .
On the other hand the expansive path consists of an up sampling of the feature map. Then there is a convolution (up-convolution) which helps to reduce the number of feature channels by 50%, next a concatenation with the equivalent cropped feature maps form the contracting path and a Convolutional layer, followed by a ReLU activation function. The target of the expansion path is to semantically map the discriminative features (in lower resolution) that was learnt by the contracting path onto the pixel space (in higher resolution) to get a dense classification. To reduce loss on border pixels in each convolution step cropping is performed. The final layer uses convolution for mapping the feature vectors to the needed number of classes [5] .
By using Transposed convolution (De convolution) expansion path up sampling can be achieved. Basically the transposed convolutions is applied in opposite direction of normal transformation. Fig. 2 shows the architecture of our U-Net model where input image size is 128 × 128, 16 bit, down sampling, achieved 256 bits, again up sampling and achieved 16 bits original.
We have taken 1812 images randomly from our leaf dataset. We have taken 10 to 20 samples from each of the leaf classes. These images are generally the segmented images which are achieved by traditional watershed segmentation approach. After performing the segmentation we have converted it to binary images by using thresholding function in OpenCV library. This binary images build our target images during training the U-net model. Table 2 shows a summary representation of our U-Net model that was used to segment the leaf samples.
Among these 117 (one seventeen) species, Asparagus officinalis L.(Wild asparagus), Crinum viviparum (Lam.)(Indian-squill) and Bacopa monnieri (L.) Wettst.(Water hyssop) segmented frames were not clearly understandable. So Asparagus officinalis L.(Wild asparagus) and Crinum viviparum (Lam.)(Indian-squill) were excluded from the segmented dataset. Yet we retained Bacopa monnieri (L.) Wettst (brahmi) in that folder. Table 3 shows some segmented leaf images of eight medicinal plants which were segmented initially by using U-Net model but the resultant images were not good and clear. So Watershed  segmentation method was applied on those resultant images and masking was done on them. Table 3 shows it clearly. The first column holds serial number, second column lists species name, third column images of U-Net segmented leaves of each listed species, fourth column includes images of U-Net + watershed segmented leaf samples.

Ethics Statement
The research work described here neither involves any human beings nor animals. All the experiments were performed with no harms on human body, animals, birds, insects etc. All the images included in this article were collected by the researchers themselves, no image or species is taken from any electronic media, books, papers, journals etc. We have received fund for carrying out this experiment from ASTEC (Assam Science Technology & Environment Council), Govt of Assam.

Declaration of Competing Interest
The authors declare that there is no known competing financial interest or personal relationships which have or could have influenced the research carried out and shown here.